1 Introduction

Lung cancer is the most lethal one in comparison with other cancer types, in both genders. Is the responsible of the 28% in the all cancer mortality. The expects for the ones who have this kind of cancer diagnosed are very poor, Usually is diagnosed in an advanced phase and the patients only have a 16% of survival rate after 5 years of their diagnosis.

The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here we analyze the expression profiles of those patients, using a pipeline based on the R/Bioconductor software package Rsubread.

This document is written in R markdown and should be processed using R and you need to install the packages knitr and markdown. Moreover, it using the official style for Bioconductor vignettes facilitated by the Bioconductor package BiocStyle. It is compiled from different files following the instructions included in the makefile.

2 Quality assessment

2.1 Data import

Lung cancer, specifically lung squamous cell carcinoma, was assessed and analyzed in this experiment. We start importing the raw table of counts.

library(SummarizedExperiment)

se.unpaired <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se.unpaired
class: RangedSummarizedExperiment 
dim: 20115 553 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
  TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
  TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(se.unpaired$type)

normal  tumor 
    51    502 

From this, we see that 20115 genes from a total of 553 samples are in this data. We can also observe that this dataset contains 502 tumor samples and 51 normal samples, which matches the information of the dataset from The Cancer Genome Atlas (TCGA) program.

Next, we explored the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata. For example, the column “type” consists of tumor and normal labels. Other information such as gender, tumor stage, and smoking status are also shown.

In any case, we can access all the possible labels with the last line of code, which returns two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.

dim(colData(se.unpaired))
[1] 553 549
colData(se.unpaired)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
                                 type                     bcr_patient_uuid
                             <factor>                             <factor>
TCGA.18.3406.01A.01R.0980.07    tumor 95b83006-02c9-4c4d-bf84-a45115f7d86d
TCGA.18.3407.01A.01R.0980.07    tumor 4e1ad82e-23c8-44bb-b74e-a3d0b1126b96
TCGA.18.3408.01A.01R.0980.07    tumor d4bc755a-2585-4529-ae36-7e1d88bdecfe
TCGA.18.3409.01A.01R.0980.07    tumor b09e872a-e837-49ec-8a27-84dcdcabf347
TCGA.18.3410.01A.01R.0980.07    tumor 99599b60-4f5c-456b-8755-371b1aa7074e
                             bcr_patient_barcode form_completion_date
                                        <factor>             <factor>
TCGA.18.3406.01A.01R.0980.07        TCGA-18-3406             2011-3-9
TCGA.18.3407.01A.01R.0980.07        TCGA-18-3407             2011-3-9
TCGA.18.3408.01A.01R.0980.07        TCGA-18-3408             2011-3-9
TCGA.18.3409.01A.01R.0980.07        TCGA-18-3409            2011-3-17
TCGA.18.3410.01A.01R.0980.07        TCGA-18-3410             2011-4-4
                             prospective_collection
                                           <factor>
TCGA.18.3406.01A.01R.0980.07                     NO
TCGA.18.3407.01A.01R.0980.07                     NO
TCGA.18.3408.01A.01R.0980.07                     NO
TCGA.18.3409.01A.01R.0980.07                     NO
TCGA.18.3410.01A.01R.0980.07                     NO
mcols(colData(se.unpaired), use.names=TRUE)
DataFrame with 549 rows and 2 columns
                                                         labelDescription
                                                              <character>
type                                           sample type (tumor/normal)
bcr_patient_uuid                                         bcr patient uuid
bcr_patient_barcode                                   bcr patient barcode
form_completion_date                                 form completion date
prospective_collection            tissue prospective collection indicator
...                                                                   ...
lymph_nodes_pelvic_pos_total                               total pelv lnp
lymph_nodes_aortic_examined_count                           total aor lnr
lymph_nodes_aortic_pos_by_he                          aln pos light micro
lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
lymph_nodes_aortic_pos_total                                total aor-lnp
                                        CDEID
                                  <character>
type                                       NA
bcr_patient_uuid                           NA
bcr_patient_barcode                   2673794
form_completion_date                       NA
prospective_collection                3088492
...                                       ...
lymph_nodes_pelvic_pos_total          3151828
lymph_nodes_aortic_examined_count     3104460
lymph_nodes_aortic_pos_by_he          3151832
lymph_nodes_aortic_pos_by_ihc         3151831
lymph_nodes_aortic_pos_total          3151827

Now, we explore the row (feature) data, which provide information on genes. For the first line of command, we can see the length of each gene and the GC content. The second line of command provides further information, such as the ranges within individual chromosome.

rowData(se.unpaired)
DataFrame with 20115 rows and 3 columns
               symbol     txlen              txgc
          <character> <integer>         <numeric>
1                A1BG      3322 0.564419024683925
2                 A2M      4844 0.488232865400495
9                NAT1      2280 0.394298245614035
10               NAT2      1322 0.389561270801815
12           SERPINA3      3067 0.524942940984676
...               ...       ...               ...
100996331       POTEB      1706 0.430832356389215
101340251    SNORD124       104 0.490384615384615
101340252   SNORD121B        81 0.407407407407407
102724473      GAGE10       538 0.505576208178439
103091865   BRWD1-IT2      1028 0.592412451361868
rowRanges(se.unpaired)
GRanges object with 20115 ranges and 3 metadata columns:
            seqnames            ranges strand |      symbol     txlen
               <Rle>         <IRanges>  <Rle> | <character> <integer>
          1    chr19 58345178-58362751      - |        A1BG      3322
          2    chr12   9067664-9116229      - |         A2M      4844
          9     chr8 18170477-18223689      + |        NAT1      2280
         10     chr8 18391245-18401218      + |        NAT2      1322
         12    chr14 94592058-94624646      + |    SERPINA3      3067
        ...      ...               ...    ... .         ...       ...
  100996331    chr15 20835372-21877298      - |       POTEB      1706
  101340251    chr17 40027542-40027645      - |    SNORD124       104
  101340252     chr9 33934296-33934376      - |   SNORD121B        81
  102724473     chrX 49303669-49319844      + |      GAGE10       538
  103091865    chr21 39313935-39314962      + |   BRWD1-IT2      1028
                         txgc
                    <numeric>
          1 0.564419024683925
          2 0.488232865400495
          9 0.394298245614035
         10 0.389561270801815
         12 0.524942940984676
        ...               ...
  100996331 0.430832356389215
  101340251 0.490384615384615
  101340252 0.407407407407407
  102724473 0.505576208178439
  103091865 0.592412451361868
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

In this analysis, we wanted to compare tumor and normal cells. In order to do this, we wanted to compare tumor and normal cells that come from the same patient. This also allows to pair the samples which reduces the possibility of false positives. In order to get them, we execute the following code.

# get the number of occurences of each patient
occur <- data.frame(table(substr(colnames(se.unpaired), 9, 12)))

# get those that occur twice
paired_df <- occur[occur$Freq > 1,]
paired <- as.vector(paired_df$Var1)
paired <- paired[2:length(paired)]

mask <- is.element(substr(colnames(se.unpaired), 9, 12), paired)

se <- se.unpaired[ ,mask]

saveRDS(se, file.path("results", "se.paired.rds"))

Since this data is unprocessed, we must first perform a quality assessment and normalize accordingly. To do this, we need first to load the edgeR R/Bioconductor package. This requires the creation of a `DGEList’ object, as the package doesn’t work directly with SummarizedExperiment objects. In any case, we will update any change in the DGElist object to the SummarizedExperiment object.

library(edgeR)

dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, file.path("results", "dge.paired.rds"))

One way of normalizing is to use the counts per million (CPM) values. This value is calculated by dividing the number of counts of each sample by 1 million. Instead of using this directly, we calculate \(\log_2\) CPM values of expression because it has nicer distributional properties than raw counts or non-logged CPM units. We set this information as an additional assay element to ease their manipulation.

assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(se)$logCPM[1:5, 1:5]
   TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1                      2.433139                     1.760377
2                      9.013463                    10.810625
9                     -6.817220                    -6.817220
10                    -6.817220                    -6.817220
12                     5.605909                     5.647356
   TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1                      2.122666                     0.775287
2                      7.713750                    11.100735
9                     -6.817220                    -6.817220
10                    -6.817220                    -6.817220
12                     4.823947                     5.856939
   TCGA.22.5478.01A.01R.1635.07
1                      3.713844
2                      9.448174
9                     -6.817220
10                    -6.817220
12                     4.920820

2.2 Sequencing depth

Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

This figure reveals no big changes in sequencing depth between samples. In addition, we see a uniform distribution of tumor and normal samples across the figure. We see a cluster of normal samples on the right side with a higher cpm, but don’t believe this will affect our analyses. We can obtain the samples with less sequencing depth using the commands below.

sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
sort(sampledepth)
43.6773 56.7823 77.7335 56.8309 56.8083 58.8386 77.7335 43.6773 56.7582 
   28.8    29.0    31.0    32.0    32.9    33.4    34.9    35.0    35.9 
56.8201 56.8201 56.8623 34.8454 77.7142 56.7579 56.7730 56.8083 56.8082 
   36.5    37.0    37.0    37.6    38.7    39.7    40.2    40.3    40.4 
56.8082 51.4081 34.8454 22.5481 56.8309 77.7337 33.4587 56.7823 56.7580 
   40.7    41.9    42.1    42.3    43.6    44.0    45.6    45.7    45.8 
56.7579 85.7710 58.8386 34.7107 56.7731 22.5471 77.7142 56.7580 56.7731 
   45.9    46.5    46.6    47.3    47.6    47.8    48.4    49.7    50.0 
85.7710 43.3394 39.5040 22.5472 60.2709 51.4079 90.6837 51.4080 77.7338 
   50.1    50.2    51.0    51.1    51.2    51.5    51.6    51.9    52.0 
56.7222 77.7338 77.7138 56.7582 22.5471 43.5670 22.4609 22.5483 92.7340 
   52.4    52.4    52.8    52.9    53.6    53.6    54.2    54.4    54.6 
77.8008 34.7107 90.7767 43.7657 22.4609 43.7657 77.7138 33.6737 43.7658 
   54.9    55.1    55.7    55.8    56.3    56.7    57.0    57.2    57.4 
77.7337 77.8007 43.5670 56.7222 56.8623 33.4587 43.7658 43.6143 22.5481 
   57.6    57.6    58.0    58.0    59.1    59.6    59.9    60.0    60.3 
22.5491 43.6647 92.7340 39.5040 22.5489 22.5482 77.8008 22.5489 22.5491 
   61.0    61.3    61.3    61.5    63.5    64.5    65.2    65.5    66.0 
77.8007 22.5478 90.6837 90.7767 56.7730 33.6737 60.2709 22.5478 43.6143 
   66.4    68.0    68.2    68.4    68.8    69.3    69.8    71.0    75.9 
22.5472 22.5482 22.4593 43.6771 22.5483 43.6647 22.4593 43.6771 51.4081 
   76.2    77.0    77.9    79.4    80.3    81.5    85.0    88.8   103.7 
51.4080 51.4079 43.3394 
  119.5   122.6   124.1 

2.3 Distribution of expression levels among samples

Next, we will look at the distribution of expression values per sample in terms of logarithmic CPM units. We display tumor and normal samples separately, and are shown in Figure 2. Each colored line in the graphs below represent a sample. In both graphs, we see two modes. The low mode represents genes not expressed in that sample. The second mode represents the genes that are expressed.

Non-parametric density distribution of expression profiles per sample.

Figure 2: Non-parametric density distribution of expression profiles per sample

In both graphs, we see some genes that deviate a little from the rest. These could be potential outliers and we have noted them for later analysis.

2.4 Distribution of expression levels among genes

Next, we wanted to look at the distribution of expression levels across genes. To do this, we calculate the average expression per gene through all the samples. Figure 3 shows the distribution of those values.

Distribution of average expression level per gene.

Figure 3: Distribution of average expression level per gene

2.5 Filtering of lowly-expressed genes

RNA sequence expression profiles that come from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. Thus, it is common to set a threshold and filter out genes that fall below this value. In the light of this plot above, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes. Now, we have a total of 11872 genes and 102 samples.

mask <- avgexp > 1
dim(se)
[1] 20115   102
se.filt <- se[mask, ]
dim(se.filt)
[1] 11872   102
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 11872   102

We then store the un-normalized versions of the filtered expression data.

saveRDS(se.filt, file.path("results", "se.paired.filt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.unnorm.rds"))

2.6 Normalization

Next, we calculate the normalization factors on the filtered expression data set using the calcNormFactors function.

dge.filt <- calcNormFactors(dge.filt)

Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)

Store normalized versions of the filtered expression data.

saveRDS(se.filt, file.path("results", "se.paired.filt.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.rds"))

2.7 MA-plots

We examine now the MA-plots of the normalized expression profiles. These plots allow us to visualize how one sample compares to the average of the rest of the samples. We look first to the tumor samples in Figure 4.

MA-plots of the tumor samples.

Figure 4: MA-plots of the tumor samples

In the MA-plots of the tumor samples, we observed the following:

  • 3 samples with high differences from the mean, with a threshold ( M > 2 or M < -2). Their MA plots have the regression line in red.
  • 14 samples with moderate differences from the mean, which we define as ( M <= 2 and M >= 1 or M >=-2 and M <= -1). Their regression line is brown.
  • 34 samples with discrete differences from the mean as they are (M < 1 or M > -1). Their regression line is green.

We observe that the appearance of samples that differ from the mean can be problematic. If during downstream analysis we observe unexpected results, those samples should be removed together with their normal pairs. We can obtain their names with the following code:

big_c
[1] "TCGA.56.7823" "TCGA.56.8623" "TCGA.77.7138"

Next, we look now to the normal samples in Figure 5.

MA-plots of the normal samples.

Figure 5: MA-plots of the normal samples

In this case, we observed:

  • 0 samples with high differences from the mean, with a threshold ( M > 2 or M < -2). Their MA plots have the regression line in red.
  • 0 samples with moderate differences from the mean, which we define as ( M <= 2 and M >= 1 or M >=-2 and M <= -1). Their regression line is brown.
  • 51 samples with discrete differences from the mean as they are (M < 1 or M > -1). Their regression line is green.

In conclusion, we do not observe either important expression-level dependent biases among the normal samples.

2.8 Batch identification

Batch effects are sub-groups of measurements that have a qualitatively different behavior across conditions and are unrelated to the variables in the study. It is important to determine if batch effects are present to know if any confounding variables are affecting the data. We will search now for potential surrogate of batch effect indicators within normal and tumor samples. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(colnames(se.filt), 6, 7)
table(data.frame(TYPE=se.filt$type, TSS=tss))
        TSS
TYPE     22 33 34 39 43 51 56 58 60 77 85 90 92
  normal 10  2  2  1  8  3 12  1  1  7  1  2  1
  tumor  10  2  2  1  8  3 12  1  1  7  1  2  1
center <- substr(colnames(se.filt), 27, 28)
table(data.frame(TYPE=se.filt$type, CENTER=center))
        CENTER
TYPE     07
  normal 51
  tumor  51
plate <- substr(colnames(se.filt), 22, 25)
table(data.frame(TYPE=se.filt$type, PLATE=plate))
        PLATE
TYPE     0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2247 2296 2326
  normal    0    0    5    4    7    1    4   10   10    2    4    2    1
  tumor     1    3    6    0    7    0    4   10   10    2    4    2    1
        PLATE
TYPE     A28V
  normal    1
  tumor     1
portionanalyte <- substr(colnames(se.filt), 18, 20)
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R 21R 31R 41R
  normal  48   3   0   0   0
  tumor   11  28   8   2   2
samplevial <- substr(colnames(se.filt), 14, 16)
table(data.frame(TYPE=se.filt$type, SAMPLEVIAL=samplevial))
        SAMPLEVIAL
TYPE     01A 01B 11A
  normal   0   0  51
  tumor   50   1   0

From this information we can make the following observations:

  • All samples were sequenced at the same center
  • Tumor and normal samples were collected in the same vial except one:
colnames(se.filt)[samplevial == "01B"]
[1] "TCGA.56.7823.01B.11R.2247.07"
  • Samples were collected evenly across different tissue source sites (TSS).
  • Samples were mostly well distributed in different plates.
  • Samples of normal cells concentrate into two different portion of analyte, while the tumor samples have five different portions.

With those results, we use the portion of analyte as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with portion of analyte.

table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R 21R 31R 41R
  normal  48   3   0   0   0
  tumor   11  28   8   2   2

As we can see here, the majority of normal samples (48) come from the “01R”" analyte and 3 come from the “11R” analyte. However, the tumor samples span across five different analytes. This could potentially lead to a batch effect.

In order to examine how the samples group together, we use two approaches: hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the surrogate of batch indicator, or portion of analyte in our case.

We perform this under a subset of our samples, as we wish to have a matrix without zeros. We remove samples belonging to portions 21R, 31R and 41R; and as the experiment is paired, also their pairs.

por1 <- as.vector(substr(colnames(se.filt)[portionanalyte == "41R"], 9, 12))
por2 <- as.vector(substr(colnames(se.filt)[portionanalyte == "31R"], 9, 12))
por3 <- as.vector(substr(colnames(se.filt)[portionanalyte == "21R"], 9, 12))
allpor <- c(por1, por2, por3)

table(is.element(substr(colnames(se.filt), 9, 12), allpor))

FALSE  TRUE 
   78    24 
se.batch <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), allpor)]
dge.batch <- DGEList(counts=assays(se.batch)$counts, genes=mcols(se.batch))

new.portionanalyte <- substr(colnames(se.batch), 18, 20)
table(data.frame(TYPE=se.batch$type, PORTIONANALYTE=new.portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R
  normal  36   3
  tumor   11  28
dim(dge.batch)
[1] 11872    78

We see that we are able to remove 24 samples, which correspond to the 12 in the portions we want to remove and their pairs.

We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 6.

logCPM <- cpm(dge.batch, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(new.portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.batch)
outcome <- paste(substr(colnames(se.batch), 9, 12), as.character(se.batch$type), sep="-")
names(outcome) <- colnames(se.batch)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 6: Hierarchical clustering of the samples

We can observe that samples cluster primarily by sample type, tumor or normal. We observe that the different portions are present in both clusters, so we don’t consider that it is inducing batch effect.

We also see that one of the tumor samples, 8623-tumor, is present in the normal samples cluster. This one also had a strong deviation in the MA plot, so we might consider discarding it.

In Figure 7 we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples, and we don’t see any sample that is separated from any cluster.

plotMDS(dge.batch, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))),
       fill=sort(unique(batch)), inset=0.05)
Multidimensional scaling plot of the samples.

Figure 7: Multidimensional scaling plot of the samples

Finally, under the consideration that there is no batch effect, the 24 samples that were removed in order to make the dendogram and the MDS plot will remain for further analysis. We decide to generate the dendogram in Figure 8 again in order to see if any other sample appears in its opposite cluster, apart from 8623-tumor.

logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 8: Hierarchical clustering of the samples

We observe that only the 8623-tumor sample doesn’t cluster with the rest of the tumor samples.

2.9 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] geneplotter_1.60.0          annotate_1.60.1            
 [3] XML_3.98-1.19               AnnotationDbi_1.44.0       
 [5] lattice_0.20-38             edgeR_3.24.3               
 [7] limma_3.38.3                SummarizedExperiment_1.12.0
 [9] DelayedArray_0.8.0          BiocParallel_1.16.6        
[11] matrixStats_0.54.0          Biobase_2.42.0             
[13] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[15] IRanges_2.16.0              S4Vectors_0.20.1           
[17] BiocGenerics_0.28.0         knitr_1.22                 
[19] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             RColorBrewer_1.1-2     compiler_3.5.3        
 [4] BiocManager_1.30.4     highr_0.8              XVector_0.22.0        
 [7] bitops_1.0-6           tools_3.5.3            zlibbioc_1.28.0       
[10] bit_1.1-14             digest_0.6.18          memoise_1.1.0         
[13] RSQLite_2.1.1          evaluate_0.13          Matrix_1.2-17         
[16] DBI_1.0.0              yaml_2.2.0             xfun_0.6              
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0          bit64_0.9-7           
[22] locfit_1.5-9.1         grid_3.5.3             rmarkdown_1.12        
[25] bookdown_0.9           blob_1.1.1             magrittr_1.5          
[28] codetools_0.2-16       htmltools_0.3.6        xtable_1.8-4          
[31] KernSmooth_2.23-15     stringi_1.4.3          RCurl_1.95-4.12       

3 Differential expression

We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva.

In order to use the method, a matrix of the model, including adjustment variables, and a null model which only includes the adjustment variables have to be created. In our case there are no factors to adjust, so we give ~1 as the intercept term in the model.

library(sva)
mod <- model.matrix(~type + 1, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))
sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is:  22 
Iteration (out of 5 ):1  2  3  4  5  
pv <- f.pvalue(assays(se.filt)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.01)
[1] 8477

There are 8477 genes changing significantly their expression at FDR < 1%. In Figure 9 below we show the distribution of the resulting p-values. This distribution should be uniform with a peak at low p-values for the truly DE genes, assuming that most genes are not differentially expressed.

Distribution of raw p-values for an F-test on every gene between tumor and normal samples.

Figure 9: Distribution of raw p-values for an F-test on every gene between tumor and normal samples

The sva() function is able to estimate surrogate variables, which have continuous values that can be used to correct and adjust unmeasured and unwanted effects.

sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is:  22 
Iteration (out of 5 ):1  2  3  4  5  
sv$n
[1] 22

The SVA algorithm has found 22 surrogate variables. Let’s use them to assess againt the extent of differential expression this time adjusting for these surrogate variables.

modsv <- cbind(mod, sv$sv)
mod0sv <- cbind(mod0, sv$sv)
pvsv <- f.pvalue(assays(se.filt)$logCPM, modsv, mod0sv)
sum(p.adjust(pvsv, method="fdr") < 0.01)
[1] 9467

We have increased the number of changing genes to 9467. Figure 10 shows the resulting distribution of p-values.

Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA.

Figure 10: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA

3.1 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] sva_3.30.1                  genefilter_1.64.0          
 [3] mgcv_1.8-28                 nlme_3.1-139               
 [5] geneplotter_1.60.0          annotate_1.60.1            
 [7] XML_3.98-1.19               AnnotationDbi_1.44.0       
 [9] lattice_0.20-38             edgeR_3.24.3               
[11] limma_3.38.3                SummarizedExperiment_1.12.0
[13] DelayedArray_0.8.0          BiocParallel_1.16.6        
[15] matrixStats_0.54.0          Biobase_2.42.0             
[17] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[19] IRanges_2.16.0              S4Vectors_0.20.1           
[21] BiocGenerics_0.28.0         knitr_1.22                 
[23] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1         xfun_0.6               splines_3.5.3         
 [4] htmltools_0.3.6        yaml_2.2.0             blob_1.1.1            
 [7] survival_2.44-1.1      DBI_1.0.0              bit64_0.9-7           
[10] RColorBrewer_1.1-2     GenomeInfoDbData_1.2.0 stringr_1.4.0         
[13] zlibbioc_1.28.0        codetools_0.2-16       evaluate_0.13         
[16] memoise_1.1.0          highr_0.8              Rcpp_1.0.1            
[19] xtable_1.8-4           BiocManager_1.30.4     XVector_0.22.0        
[22] bit_1.1-14             digest_0.6.18          stringi_1.4.3         
[25] bookdown_0.9           grid_3.5.3             tools_3.5.3           
[28] bitops_1.0-6           magrittr_1.5           RCurl_1.95-4.12       
[31] RSQLite_2.1.1          Matrix_1.2-17          rmarkdown_1.12        
[34] compiler_3.5.3        

4 Functional analysis

Here we will perform the functional analysis in the second part of the project.

4.1 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] sva_3.30.1                  genefilter_1.64.0          
 [3] mgcv_1.8-28                 nlme_3.1-139               
 [5] geneplotter_1.60.0          annotate_1.60.1            
 [7] XML_3.98-1.19               AnnotationDbi_1.44.0       
 [9] lattice_0.20-38             edgeR_3.24.3               
[11] limma_3.38.3                SummarizedExperiment_1.12.0
[13] DelayedArray_0.8.0          BiocParallel_1.16.6        
[15] matrixStats_0.54.0          Biobase_2.42.0             
[17] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[19] IRanges_2.16.0              S4Vectors_0.20.1           
[21] BiocGenerics_0.28.0         knitr_1.22                 
[23] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             RColorBrewer_1.1-2     compiler_3.5.3        
 [4] BiocManager_1.30.4     XVector_0.22.0         bitops_1.0-6          
 [7] tools_3.5.3            zlibbioc_1.28.0        bit_1.1-14            
[10] digest_0.6.18          memoise_1.1.0          evaluate_0.13         
[13] RSQLite_2.1.1          Matrix_1.2-17          DBI_1.0.0             
[16] yaml_2.2.0             xfun_0.6               GenomeInfoDbData_1.2.0
[19] stringr_1.4.0          bit64_0.9-7            locfit_1.5-9.1        
[22] grid_3.5.3             survival_2.44-1.1      rmarkdown_1.12        
[25] bookdown_0.9           blob_1.1.1             magrittr_1.5          
[28] splines_3.5.3          htmltools_0.3.6        xtable_1.8-4          
[31] stringi_1.4.3          RCurl_1.95-4.12       

5 References